Duplicates, redundancies and inconsistencies in the primary nucleotide databases: a descriptive study
نویسندگان
چکیده
GenBank, the EMBL European Nucleotide Archive and the DNA DataBank of Japan, known collectively as the International Nucleotide Sequence Database Collaboration or INSDC, are the three most significant nucleotide sequence databases. Their records are derived from laboratory work undertaken by different individuals, by different teams, with a range of technologies and assumptions and over a period of decades. As a consequence, they contain a great many duplicates, redundancies and inconsistencies, but neither the prevalence nor the characteristics of various types of duplicates have been rigorously assessed. Existing duplicate detection methods in bioinformatics only address specific duplicate types, with inconsistent assumptions; and the impact of duplicates in bioinformatics databases has not been carefully assessed, making it difficult to judge the value of such methods. Our goal is to assess the scale, kinds and impact of duplicates in bioinformatics databases, through a retrospective analysis of merged groups in INSDC databases. Our outcomes are threefold: (1) We analyse a benchmark dataset consisting of duplicates manually identified in INSDC-a dataset of 67 888 merged groups with 111 823 duplicate pairs across 21 organisms from INSDC databases - in terms of the prevalence, types and impacts of duplicates. (2) We categorize duplicates at both sequence and annotation level, with supporting quantitative statistics, showing that different organisms have different prevalence of distinct kinds of duplicate. (3) We show that the presence of duplicates has practical impact via a simple case study on duplicates, in terms of GC content and melting temperature. We demonstrate that duplicates not only introduce redundancy, but can lead to inconsistent results for certain tasks. Our findings lead to a better understanding of the problem of duplication in biological databases.Database URL: the merged records are available at https://cloudstor.aarnet.edu.au/plus/index.php/s/Xef2fvsebBEAv9w.
منابع مشابه
Association of two single nucleotide polymorphisms rs10407022 and rs3741664 with the risk of primary ovarian insufficiency in a sample of Iraqi women
Primary ovarian insufficiency (POI) can be a devastating disease impacting women below the age of forty. This involves a major decrease in the amount and quality of oocytes, or ovarian reserve in a woman. The distribution of single-nucleotide polymorphisms, rs10407022 and rs3741664, in Iraqi people and its association with primary ovarian insufficiency is the main objective of this study. The m...
متن کاملA Rule-Based Data Standardizer for Enterprise Data Bases
Whenever a database permits textual entry of information | for example when data is copied from a paper form | the database is likely to contain duplicates and inconsistencies. These duplicates must be removed and inconsistencies resolved in order to mine the data or to use the data for decision support. We term the domain-speci c solution to duplicate and inconsistency removal data standardiza...
متن کاملInvestigating “Accord Theory” of Abdulghaher Jorjani and Jacques Augustin Berque in Eliminating Inconsistencies in the Apparent Meaning of the Quran (Case-study of the Sura of Ahqaf)
Abstract In recent era, debates around the inconsistency in the apparent meaning of Suras and Verses of the Quran has been abundant among the contemporary Quran researchers. On the other hand, many Islamic scientists and non-Islamic researchers have striven to prove the consistency of the apparent meaning of the Verses of the Quran and have put forward theories from which we may mention the “th...
متن کاملInvestigating “Accord Theory” of Abdulghaher Jorjani and Jacques Augustin Berque in Eliminating Inconsistencies in the Apparent Meaning of the Quran (Case-study of the Sura of Ahqaf)
Abstract In recent era, debates around the inconsistency in the apparent meaning of Suras and Verses of the Quran has been abundant among the contemporary Quran researchers. On the other hand, many Islamic scientists and non-Islamic researchers have striven to prove the consistency of the apparent meaning of the Verses of the Quran and have put forward theories from which we may mention the “th...
متن کاملهمپوشانی سنتی و نسبی پایگاه های اطلاعاتی Scopus و Web of Sciences در حوزه بیماریهای غدد درونریز
Introduction: This study aimed to determine the traditional and relative overlap between Scopus and Web of Science databases in Endocrine System Diseases. Methods: This research is a descriptive survey and an applied study. Research population includes all articles retrieved from Scopus and Web of Science databases. 11 Descriptors and 120 sub-heading were searched in endocrine field in 2009....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 2017 شماره
صفحات -
تاریخ انتشار 2017